-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI][GPU] Mergepath sort with odd-even block sort #7611
Conversation
@mbrookhart This doesn't work on VK/SPIR-V because SPIR-V requires a thread block size to be a compile time constant, see tvm/src/target/spirv/codegen_spirv.cc Line 106 in d7f5753
In particular, the following size selection has an issue, because tvm/python/tvm/topi/cuda/sort.py Lines 335 to 337 in d2940ca
So to workaround this problem, can we either
? |
Can we just run the full thread block size and nop on the extra threads? |
## Perhaps we can autotune? | ||
block_size = 128 | ||
thread_work = 4 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @merrymercy can we use autotvm for this? i.e. a traditional auto tuner way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any tips on how to do it?
@mbrookhart Can you comment on the difference in the way two-level merge path is done, between your implementation and
|
a7c8c62
to
c3aa1c6
Compare
@masahi I was having a hard time getting partition information back out when I used a seperate kernel, mostly due to handling memory in IR builder. Running the binary search in the kernel that will eventual merge the partitions was easier to handle from a memory-passing sense. I could go back and try temporary allocations to see if the added complexity gets us any performance improvement. Still thinking about the threading for vulkan. |
There is an issue with dynamic topk? I saw the same failure yesterday |
I've got a dynamic topk failure, but I can't reproduce it locally, that test passes on my version of dependencies. Maybe I can run the CI docker locally |
@masahi I pushed the change to vulkan threading, could you test that on your GPU? I build the CI docker on my machine and attempting to reproduce the dynamic topk failure, I still was unable to. I'm not sure where to go next. |
On VK, it fails on (122640, 1) workload. rocm works fine.
|
I validated that that threading change passed tests when I applied it to cuda, it just hurt performance :( I wonder if I'm hitting another spirv codegen issue? |
yeah the same fixed-size threading applied to rocm works fine, so this is definitely a spirv issue. I'll see what's going on. |
Interesting! I did a binary search for which size fails on VK, size below 2048 works fine, but starting from 2049 it fails. |
@mbrookhart Also on the failure from On the old sort IR, rocm works fine with empty input. VK still gives segfault, that's another story :) |
@masahi that might be the point at which I switch from single level mergepath to dual level mergepath. I wonder if vulkan is having trouble with the extra control flow there. |
interesting, is it possible to always force single level? |
I confirmed that this sort also works with VK / SPIRV backend with the following great results:
|
Thanks @mbrookhart |
* Mergepath sort with odd-even block sort * fix lint, add test * respond to review comments * speed up tests by reducing dtype skews * fix bad rebase * change threading to support vulkan * fix lint * only sort if the data is non-empty * fix lint again * fix for vk * move if to higher scope * fix typo Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
* Mergepath sort with odd-even block sort * fix lint, add test * respond to review comments * speed up tests by reducing dtype skews * fix bad rebase * change threading to support vulkan * fix lint * only sort if the data is non-empty * fix lint again * fix for vk * move if to higher scope * fix typo Co-authored-by: Masahiro Masuda <masahi@129@gmail.com>
Building on @masahi's work to add a while loop to TIR, this PR implements the MergePath algorithm to parallelize the later stages of mergesort. It also implements a stable in-shared-memory odd-even sort to do small block sorting before mergesort for better speed. All told, this optimization gives us dramatic speedups on both AMD and Nvidia GPUs and trades blows with our thrust implementation for many shapes, see below. cc @Laurawly @zhiics @icemelon9 @csullivan @tkonolige
I had to add a path for skipping thread planning to get handed-threaded odd-even-tranpose sort to work, cc @tqchen @junrushao1994